iscream

Reading WGBS data into analysis-ready structures

James Eapen, Jacob Morrison, Hui Shen

July 25, 2024

Whole genome bisulfite sequencing

Converting an epigenetic difference into a genetic difference


mC \(\xrightarrow{bisulfite}\) C


    C \(\xrightarrow{bisulfite}\) U

DNA methylation analysis

WGBS alignment

  • Bismark
  • BISCUIT

Methylation analysis

  • BSseq: read.bismark \(\to\) bsseq \(\to\) DMRSeq, DMRcate

  • biscuiteer: readBiscuit \(\to\) BSseq

  • scMET

Analysis

BSseq

  • read.bismark to read Bismark .cov files

  • M and Coverage count matrices

  • GRanges storing CpG loci

  • Used by DMRSeq, DMRcate

An object of type 'BSseq' with
  69349 methylation loci
  93 samples
has not been smoothed
All assays are in-memory
read.bismark(files)

Analysis

BSseq

  • read.bismark to read Bismark .cov files

  • M and Coverage count matrices

  • GRanges storing CpG loci

  • Used by DMRSeq, DMRcate

An object of type 'BSseq' with
  69349 methylation loci
  93 samples
has not been smoothed
All assays are in-memory
gr <- GRanges([CpG loci])
read.bismark(files, gr)

iscream overview

iscream overview

Implementation

  • Querying and region operations written in Cpp
  • Queries made with htslib
  • Multithreaded across input files

Usage

data.frame <- region_map(
  tabixed_bedfiles_vec,
  genomic_interval_string_vec,
  fun = "aggregate", # or average
  bismark = FALSE,   # TRUE if biskmark .cov
  mval = TRUE        # FALSE for beta value
)
BSseq <- query_all(
  tabixed_bedfiles_vec,
  genomic_interval_string_vec,
  bismark = FALSE,
  merged = TRUE, # FALSE for uncollapsed strands
)
library(iscream)
options("iscream.threads" = 8)
files <- list.files(
  "aligner_output",
  pattern = "*.bed.gz",
  full.names = TRUE
)
regions <- (
  data.table::fread("windows.bed")
  [, paste0(V1, ":", V2, "-", V3)]
)
region_map(files, regions, fun = "aggregate")
query_all(files, regions)

Benchmarks

Runtime

CPU: Intel Xeon CPU E5-2698 v4 @ 2.20GHz

Benchmarks

Memory usage

CPU: Intel Xeon CPU E5-2698 v4 @ 2.20GHz

Benchmarks

Runtime

CPU: Intel Xeon CPU E5-2698 v4 @ 2.20GHz

Benchmarks

CPU: Intel Xeon CPU E5-2698 v4 @ 2.20GHz

Future directions

  • Fast reading of full files
  • Support for converting to and reading from arrow files and datasets
  • Comprehensive data checks
  • Sparse matrix support

Summary

  • iscream is an R package to read WGBS data into analysis-ready data structures
  • Can query genomic regions to extract CpGs only from regions of interest
  • Collect all CpGs into matrices or aggregate/average them within windows
  • Queries are parallelized across files

Acknowledgements

Shen lab

Ian Beddows
Svetlana Djirackor
Ben Johnson
Kelly Kryzanowski
Paige Matusiak
Amy Nuffesse (Admin)
Mary Rhodes
Ayush Semwal
Tashi Zandstra


image/svg+xml

github.com/huishenlab/iscream

References

Hansen, K. D., Langmead, B., & Irizarry, R. A. (2012). BSmooth: From whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology, 13(10), R83. https://doi.org/10.1186/gb-2012-13-10-r83
Kapourani, C.-A., Argelaguet, R., Sanguinetti, G., & Vallejos, C. A. (2021). scMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolution. Genome Biology, 22(1), 114. https://doi.org/10.1186/s13059-021-02329-8
Korthauer, K., Chakraborty, S., Benjamini, Y., & Irizarry, R. A. (2019). Detection and accurate false discovery rate control of differentially methylated regions from whole genome bisulfite sequencing. Biostatistics, 20(3), 367–383. https://doi.org/10.1093/biostatistics/kxy007
Krueger, F., & Andrews, S. R. (2011). Bismark: A flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics, 27(11), 1571–1572. https://doi.org/10.1093/bioinformatics/btr167
Morrison, J., Zhou, W., Johnson, B. K., & Shen, H. (2023). Dupsifter: A lightweight duplicate marking tool for whole genome bisulfite sequencing. Bioinformatics, 39(12), btad729. https://doi.org/10.1093/bioinformatics/btad729
Peters, T. J., Buckley, M. J., Chen, Y., Smyth, G. K., Goodnow, C. C., & Clark, S. J. (2021). Calling differentially methylated regions from whole genome bisulphite sequencing with DMRcate. Nucleic Acids Research, 49(19), e109–e109. https://doi.org/10.1093/nar/gkab637
Zhou, W., Johnson, B. K., Morrison, J., Beddows, I., Eapen, J., Katsman, E., Semwal, A., Habib, W. A., Heo, L., Laird, P. W., Berman, B. P., Triche, T. J., Jr., & Shen, H. (2024). BISCUIT: An efficient, standards-compliant tool suite for simultaneous genetic and epigenetic inference in bulk and single-cell studies. Nucleic Acids Research, gkae097. https://doi.org/10.1093/nar/gkae097